24 research outputs found
Bayesian Exploration Networks
Bayesian reinforcement learning (RL) offers a principled and elegant approach
for sequential decision making under uncertainty. Most notably, Bayesian agents
do not face an exploration/exploitation dilemma, a major pathology of
frequentist methods. A key challenge for Bayesian RL is the computational
complexity of learning Bayes-optimal policies, which is only tractable in toy
domains. In this paper we propose a novel model-free approach to address this
challenge. Rather than modelling uncertainty in high-dimensional state
transition distributions as model-based approaches do, we model uncertainty in
a one-dimensional Bellman operator. Our theoretical analysis reveals that
existing model-free approaches either do not propagate epistemic uncertainty
through the MDP or optimise over a set of contextual policies instead of all
history-conditioned policies. Both approximations yield policies that can be
arbitrarily Bayes-suboptimal. To overcome these issues, we introduce the
Bayesian exploration network (BEN) which uses normalising flows to model both
the aleatoric uncertainty (via density estimation) and epistemic uncertainty
(via variational inference) in the Bellman operator. In the limit of complete
optimisation, BEN learns true Bayes-optimal policies, but like in variational
expectation-maximisation, partial optimisation renders our approach tractable.
Empirical results demonstrate that BEN can learn true Bayes-optimal policies in
tasks where existing model-free approaches fail
Perfectly Secure Steganography Using Minimum Entropy Coupling
Steganography is the practice of encoding secret information into innocuous
content in such a manner that an adversarial third party would not realize that
there is hidden meaning. While this problem has classically been studied in
security literature, recent advances in generative models have led to a shared
interest among security and machine learning researchers in developing scalable
steganography techniques. In this work, we show that a steganography procedure
is perfectly secure under Cachin (1998)'s information-theoretic model of
steganography if and only if it is induced by a coupling. Furthermore, we show
that, among perfectly secure procedures, a procedure maximizes information
throughput if and only if it is induced by a minimum entropy coupling. These
insights yield what are, to the best of our knowledge, the first steganography
algorithms to achieve perfect security guarantees for arbitrary covertext
distributions. To provide empirical validation, we compare a minimum entropy
coupling-based approach to three modern baselines -- arithmetic coding, Meteor,
and adaptive dynamic grouping -- using GPT-2, WaveRNN, and Image Transformer as
communication channels. We find that the minimum entropy coupling-based
approach achieves superior encoding efficiency, despite its stronger security
constraints. In aggregate, these results suggest that it may be natural to view
information-theoretic steganography through the lens of minimum entropy
coupling
Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
In many real-world settings, a team of agents must coordinate its behaviour
while acting in a decentralised fashion. At the same time, it is often possible
to train the agents in a centralised fashion where global state information is
available and communication constraints are lifted. Learning joint
action-values conditioned on extra state information is an attractive way to
exploit centralised learning, but the best strategy for then extracting
decentralised policies is unclear. Our solution is QMIX, a novel value-based
method that can train decentralised policies in a centralised end-to-end
fashion. QMIX employs a mixing network that estimates joint action-values as a
monotonic combination of per-agent values. We structurally enforce that the
joint-action value is monotonic in the per-agent values, through the use of
non-negative weights in the mixing network, which guarantees consistency
between the centralised and decentralised policies. To evaluate the performance
of QMIX, we propose the StarCraft Multi-Agent Challenge (SMAC) as a new
benchmark for deep multi-agent reinforcement learning. We evaluate QMIX on a
challenging set of SMAC scenarios and show that it significantly outperforms
existing multi-agent reinforcement learning methods.Comment: Extended version of the ICML 2018 conference paper (arXiv:1803.11485
Revealing Robust Oil and Gas Company Macro-Strategies using Deep Multi-Agent Reinforcement Learning
The energy transition potentially poses an existential risk for major
international oil companies (IOCs) if they fail to adapt to low-carbon business
models. Projections of energy futures, however, are met with diverging
assumptions on its scale and pace, causing disagreement among IOC
decision-makers and their stakeholders over what the business model of an
incumbent fossil fuel company should be. In this work, we used deep multi-agent
reinforcement learning to solve an energy systems wargame wherein players
simulate IOC decision-making, including hydrocarbon and low-carbon investments
decisions, dividend policies, and capital structure measures, through an
uncertain energy transition to explore critical and non-linear governance
questions, from leveraged transitions to reserve replacements. Adversarial play
facilitated by state-of-the-art algorithms revealed decision-making strategies
robust to energy transition uncertainty and against multiple IOCs. In all
games, robust strategies emerged in the form of low-carbon business models as a
result of early transition-oriented movement. IOCs adopting such strategies
outperformed business-as-usual and delayed transition strategies regardless of
hydrocarbon demand projections. In addition to maximizing value, these
strategies benefit greater society by contributing substantial amounts of
capital necessary to accelerate the global low-carbon energy transition. Our
findings point towards the need for lenders and investors to effectively
mobilize transition-oriented finance and engage with IOCs to ensure responsible
reallocation of capital towards low-carbon business models that would enable
the emergence of fossil fuel incumbents as future low-carbon leaders
QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
In many real-world settings, a team of agents must coordinate their behaviour
while acting in a decentralised way. At the same time, it is often possible to
train the agents in a centralised fashion in a simulated or laboratory setting,
where global state information is available and communication constraints are
lifted. Learning joint action-values conditioned on extra state information is
an attractive way to exploit centralised learning, but the best strategy for
then extracting decentralised policies is unclear. Our solution is QMIX, a
novel value-based method that can train decentralised policies in a centralised
end-to-end fashion. QMIX employs a network that estimates joint action-values
as a complex non-linear combination of per-agent values that condition only on
local observations. We structurally enforce that the joint-action value is
monotonic in the per-agent values, which allows tractable maximisation of the
joint action-value in off-policy learning, and guarantees consistency between
the centralised and decentralised policies. We evaluate QMIX on a challenging
set of StarCraft II micromanagement tasks, and show that QMIX significantly
outperforms existing value-based multi-agent reinforcement learning methods.Comment: Camera-ready version, International Conference of Machine Learning
201
Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning
Multi-agent settings in the real world often involve tasks with varying types
and quantities of agents and non-agent entities; however, common patterns of
behavior often emerge among these agents/entities. Our method aims to leverage
these commonalities by asking the question: ``What is the expected utility of
each agent when only considering a randomly selected sub-group of its observed
entities?'' By posing this counterfactual question, we can recognize
state-action trajectories within sub-groups of entities that we may have
encountered in another task and use what we learned in that task to inform our
prediction in the current one. We then reconstruct a prediction of the full
returns as a combination of factors considering these disjoint groups of
entities and train this ``randomly factorized" value function as an auxiliary
objective for value-based multi-agent reinforcement learning. By doing so, our
model can recognize and leverage similarities across tasks to improve learning
efficiency in a multi-task setting. Our approach, Randomized Entity-wise
Factorization for Imagined Learning (REFIL), outperforms all strong baselines
by a significant margin in challenging multi-task StarCraft micromanagement
settings.Comment: ICML 2021 Camera Read